20 research outputs found
BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing
Matrix-matrix multiplication is a key computational kernel for numerous
applications in science and engineering, with ample parallelism and data
locality that lends itself well to high-performance implementations. Many
matrix multiplication-dependent applications can use reduced-precision integer
or fixed-point representations to increase their performance and energy
efficiency while still offering adequate quality of results. However, precision
requirements may vary between different application phases or depend on input
data, rendering constant-precision solutions ineffective. We present BISMO, a
vectorized bit-serial matrix multiplication overlay for reconfigurable
computing. BISMO utilizes the excellent binary-operation performance of FPGAs
to offer a matrix multiplication performance that scales with required
precision and parallelism. We characterize the resource usage and performance
of BISMO across a range of parameters to build a hardware cost model, and
demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
Research has shown that convolutional neural networks contain significant
redundancy, and high classification accuracy can be obtained even when weights
and activations are reduced from floating point to binary values. In this
paper, we present FINN, a framework for building fast and flexible FPGA
accelerators using a flexible heterogeneous streaming architecture. By
utilizing a novel set of optimizations that enable efficient mapping of
binarized neural networks to hardware, we implement fully connected,
convolutional and pooling layers, with per-layer compute resources being
tailored to user-provided throughput requirements. On a ZC706 embedded FPGA
platform drawing less than 25 W total system power, we demonstrate up to 12.3
million image classifications per second with 0.31 {\mu}s latency on the MNIST
dataset with 95.8% accuracy, and 21906 image classifications per second with
283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1%
and 94.9% accuracy. To the best of our knowledge, ours are the fastest
classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable
Gate Arrays, February 201
Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
Matrix-matrix multiplication is a key computational kernel for numerous
applications in science and engineering, with ample parallelism and data
locality that lends itself well to high-performance implementations. Many
matrix multiplication-dependent applications can use reduced-precision integer
or fixed-point representations to increase their performance and energy
efficiency while still offering adequate quality of results. However, precision
requirements may vary between different application phases or depend on input
data, rendering constant-precision solutions ineffective. BISMO, a vectorized
bit-serial matrix multiplication overlay for reconfigurable computing,
previously utilized the excellent binary-operation performance of FPGAs to
offer a matrix multiplication performance that scales with required precision
and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an
arithmetic architecture that better utilizes 6-LUTs. The improved BISMO
achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a
Xilinx UltraScale+ MPSoC.Comment: Invited paper at ACM TRETS as extension of FPL'18 paper
arXiv:1806.0886
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference
Efficient machine learning implementations optimized for inference in
hardware have wide-ranging benefits, depending on the application, from lower
inference latency to higher data throughput and reduced energy consumption. Two
popular techniques for reducing computation in neural networks are pruning,
removing insignificant synapses, and quantization, reducing the precision of
the calculations. In this work, we explore the interplay between pruning and
quantization during the training of neural networks for ultra low latency
applications targeting high energy physics use cases. Techniques developed for
this study have potential applications across many other domains. We study
various configurations of pruning during quantization-aware training, which we
term quantization-aware pruning, and the effect of techniques like
regularization, batch normalization, and different pruning schemes on
performance, computational complexity, and information content metrics. We find
that quantization-aware pruning yields more computationally efficient models
than either pruning or quantization alone for our task. Further,
quantization-aware pruning typically performs similar to or better in terms of
computational efficiency compared to other neural architecture search
techniques like Bayesian optimization. Surprisingly, while networks with
different training configurations can have similar performance for the
benchmark application, the information content in the network can vary
significantly, affecting its generalizability.Comment: 22 pages, 7 Figures, 1 Tabl
LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations
We propose two tiers of modifications to FPGA logic cell architecture to
deliver a variety of performance and utilization benefits with only minor area
overheads. In the irst tier, we augment existing commercial logic cell
datapaths with a 6-input XOR gate in order to improve the expressiveness of
each element, while maintaining backward compatibility. This new architecture
is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary
tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we
refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor
tree synthesis using generalized parallel counters (GPCs) is further improved
with the proposed modifications. Using both the Intel adaptive logic module and
the Xilinx slice at the 65nm technology node for a comparative study, it is
shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We
demonstrate that LUXOR can deliver an average reduction of 13-19% in logic
utilization on micro-benchmarks from a variety of domains.BNN benchmarks
benefit the most with an average reduction of 37-47% in logic utilization,
which is due to the highly-efficient mapping of the XnorPopcount operation on
our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA,
US
Accelerating Sparse Linear Algebra and Deep Neural Networks on Reconfigurable Platforms
Regardless of whether the chosen figure of merit is execution time, throughput, battery life for an embedded system or total cost of ownership for a datacenter, today’s computers are fundamentally limited by their energy efficiency. Using specialized hardware-software solutions for particular applications or domains is a well-known approach to increase energy efficiency of computing systems. Reconfigurable logic in the form of Field-Programmable Gate Arrays (FPGAs) is a particularly promising substrate for hardware specialization, owing to its runtime reconfigurability, vastly parallel compute fabric and widespread availability. However, mapping computation to reconfigurable logic in a way which provides performance and efficiency benefits is a significant challenge due to the vast design space. In this thesis, we study how two particular domains can benefit from specialized architectures on reconfigurable logic. We focus on sparse linear algebra and deep neural network inference, whose execution is known to be particularly problematic on today’s general-purpose computers.
For sparse linear algebra, lack of spatial and temporal locality in memory accesses pose a fundamental problem. We address this problem by taking advantage of the flexibility of reconfigurable logic to construct specialized memory systems.We propose a hardware-software caching scheme which uses lightweight preprocessing to extract key access pattern information fromsparse matrices to offer greatly increased random access efficiency with minimal on-chip memory usage. Furthermore, we demonstrate the broader applicability of the specialization for sparse linear algebra to graph analytics with an accelerator for breadth-first search that uses off-chip memory bandwidth more efficiently compared to prior work.
For deep neural network inference, the sheer energy and hardware resource cost of floating point computation is a fundamental limitation on energy efficiency. Exploiting recent advances in training highly quantized neural networks (QNNs), we demonstrate how FPGAs can be leveraged for accurate, energy-efficient and high-performance neural network inference.We propose the FINN framework to generate customized architectures with compute resources tailored to user-specified performance requirements while exploiting multiple levels of parallelism for high energy efficiency. We also describe mathematical simplifications for making QNN inference more resourceefficient, and show how binary matrix operators can be used as bit-serial building blocks for higher-precision computation